{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.document_loaders import PyPDFLoader\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.vectorstores import Chroma\n",
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "OPENAI_API_KEY = getpass()\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\n",
    "\n",
    "embedding = OpenAIEmbeddings()\n",
    "\n",
    "\n",
    "# Load pdf\n",
    "loader = PyPDFLoader(\"E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf\")\n",
    "data = loader.load()\n",
    "\n",
    "# Split\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
    "splits = text_splitter.split_documents(data[:6])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.retrievers import BM25Retriever, EnsembleRetriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "bm25_retriever = BM25Retriever.from_documents(\n",
    "    documents=splits\n",
    ")\n",
    "bm25_retriever.k = 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "vectordb = Chroma.from_documents(documents=splits, embedding=embedding)\n",
    "\n",
    "retriever = vectordb.as_retriever(search_kwargs={\"k\": 4})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble_retriever = EnsembleRetriever(\n",
    "    retrievers=[bm25_retriever, retriever], weights=[0.5, 0.5]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Baichuan 2 excels in vertical domains such\\nas medicine and law. We will release all\\npre-training model checkpoints to benefit the\\nresearch community in better understanding\\nthe training dynamics of Baichuan 2.\\n1 Introduction\\nThe field of large language models has witnessed\\npromising and remarkable progress in recent years.\\nThe size of language models has grown from\\nmillions of parameters, such as ELMo (Peters\\net al., 2018), GPT-1 (Radford et al., 2018), to', metadata={'page': 0, 'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf'}),\n",
       " Document(page_content='daniel@baichuan-inc.com.ChatGPT (OpenAI, 2022) from OpenAI, the power\\nof these models to generate human-like text has\\ncaptured widespread public attention. ChatGPT\\ndemonstrates strong language proficiency across\\na variety of domains, from conversing casually to\\nexplaining complex concepts. This breakthrough\\nhighlights the potential for large language models\\nto automate tasks involving natural language\\ngeneration and comprehension.\\nWhile there have been exciting breakthroughs', metadata={'page': 0, 'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf'}),\n",
       " Document(page_content='and applications of LLMs, most leading LLMs like\\nGPT-4 (OpenAI, 2023), PaLM-2 (Anil et al., 2023),\\nand Claude (Claude, 2023) remain closed-sourced.\\nDevelopers and researchers have limited access to\\nthe full model parameters, making it difficult for\\nthe community to deeply study or fine-tune these\\nsystems. More openness and transparency around\\nLLMs could accelerate research and responsible\\ndevelopment within this rapidly advancing field.\\nLLaMA (Touvron et al., 2023a), a series of large', metadata={'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf', 'page': 0}),\n",
       " Document(page_content='Baichuan 2: Open Large-scale Language Models\\nAiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan\\nDian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai\\nGuosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji\\nJian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma\\nMang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun', metadata={'page': 0, 'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf'}),\n",
       " Document(page_content='billions or even trillions of parameters such as GPT-\\n3 (Brown et al., 2020), PaLM (Chowdhery et al.,\\n2022; Anil et al., 2023) and Switch Transformers\\n(Fedus et al., 2022). This increase in scale has\\nled to significant improvements in the capabilities\\nof language models, enabling more human-like\\nfluency and the ability to perform a diverse range\\nof natural language tasks. With the introduction of\\nAuthors are listed alphabetically, correspondent:', metadata={'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf', 'page': 0}),\n",
       " Document(page_content='feature engineering. However, most powerful\\nLLMs are closed-source or limited in their\\ncapability for languages other than English. In\\nthis technical report, we present Baichuan 2,\\na series of large-scale multilingual language\\nmodels containing 7 billion and 13 billion\\nparameters, trained from scratch, on 2.6 trillion\\ntokens. Baichuan 2 matches or outperforms\\nother open-source models of similar size on\\npublic benchmarks like MMLU, CMMLU,\\nGSM8K, and HumanEval. Furthermore,', metadata={'page': 0, 'source': 'E:\\\\langchain_RAG\\\\data\\\\baichuan.pdf'})]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = ensemble_retriever.invoke(\"What is baichuan2 ？\")\n",
    "docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
